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1.0 INTRODUCTION 

The CONTROL DATA® CYBER 205 Computer System is based on the architecture 
of the CYBER 203 and STAR 100 with significant logical and electronic 
advances over those systems. The CYBER 205 uses a second generation 
vector processor that makes use of the latest technology in LSI logic. 
Array and matrix operations can now be performed at rates up to eight 
times faster than those of the CYBER 203 and STAR-100. The performance 
of the data motion instructions (gather, scatter, compress, etc.) has 
been improved up to 32 times faster than those of the previous systems. ' 

The CYBER 205 is designed to perform both conventional and special arith- 
metic used in the simulation of complex physical environments, which 
could previously not be performed, because no computer system was capable 
of delivering the required computational power within a reasonable time 
frame. 

This paper illustrates some of the CYBER 205 hardware capabilities and 
provides an introduction to the programming methodology for the CYBER 205 
vector processor. 
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CYBER 205 ARCHITECTURE 

A summary of the architecture of the Control Data CYBER 205 computer is 
given. The CYBER 205 is a superscale, highspeed scientific computer 
system with segmented scalar functional units, a vector processor 
containing up to four floating-point pipelines and up to four million 
64-bit words of semiconductor memory. Peak performance on the vector 
processor is 800 million 32-bit floating point operations per second for 
linked multiply and add triads. Figure 2-1 is a diagram of the CYBER 
205. 



1 Central Proc 



essor 



The central processor unit consists of three functional areas: 

o Scalar processor 
o Vector processor 
o Input/output 

The scalar processor decodes all instructions from central memory and 
directs vector/string instructions to the vector processor for execu- 
tion. An instruction stack provides buffering for 64 virtually addressed 
64-bit words, which can contain up to 128 32-bit instructions, 64 64-bit 
instructions, or a combination of both. The processor is capable of " 
issuing instructions at. a peak rate of one instruction every 20 
nanoseconds, provided there are no conflicts. T'he scalar processor can 
execute scalar instructions in parallel with most vector instructions if 
there are no memory references generated by the scalar instruction for 
operands. To minimize memory references by scalar instructions, a 
register file consisting of 256 general purpose registers is used to 
maintain single element operands within subprograms. 

The scalar processor contains five independent functional units: 

o Add/Subtract unit 

o Multiply unit 

o Logical unit 

o Single cycle unit 

o Divide/Square Root/ Convert unit 
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Figure 2-1. CYBER 205 
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All functional units are segmented and capable of accepting new operands 

every 20 nanoseconds with the exception of the Divide/Square Root/Convert 

unit which nust complete each operation before a new one can begin. All 

units are capable of being shortstopped. Shortstop is a process by which 

i ' ■ • * 

a result from an arithmetic unit can be returned directly to either input 

of any arithmetic unit without waiting for the result to be stored in the 
register file. 

Figure 2-2 is a diagram of the functional components of the scalar 
processor. Peak performance of the scalar processor Is 50 million 32-bit 
or 64-bit floating point operations per second. 



VECTOR 
PROCESSOR 



SECOEO 



KET: 



TO VECTOR PROCESSOR 

f 



RNS/ 

BRANCH 

UNIT 



J 



PRIORITY |_7 
UNIT rr 



11 



P 



±A 



ASSOCIATIVE 
UNIT 



♦■ DATA OR ADDRESS 
-» CONTROL 



INSTRUCTION 

STACK 

(8 SWORD) 



2l 



INSTRUCTION 

ISSUE 

PIPE 



~T~ 



M 



LA 



LOAD/ 
STORE 
UNIT 



1 



REGISTER 

FILE 

(61X236) 



1 



T 



VECTOR PROCESSOR 



SCALAR FLOATING 
POINT 




I MULT 
(PIPE 




SINGLE 
CYCLE 
PIPE 



DIVIDE/ 
SORT/ 
CONVERT 
UNIT 



Figure 2-2. Functional Components of the Scalar Processor 
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The vector unit consists of the stream unit, the string unit, and the 
segmented vector pipeline units. 

The stream unit receives decoded instructions from the scalar unit and 
controls the data streams between central memory and the vector pipe- 
lines. 

The string unit performs operations on byte and bit vectors which are 
used for logical and control vector operations. 

The vector pipeline units are used for vector add/subtract, multiply, and 
divide/square root operations. For vector addition, subtraction, and 
multiplication, the computer contains one, two, or four 64-bit pipelines. 
The vector floating point arithmetic instructions not only perform the 
defined arithmetic operation, but can perform additional operations 
without any increase in operation time. These additional operations are: 

o The magnitude (absolute value) of the operands from one or both input 
vectors are used. 

o The coefficients of the operands from one input vector are 
complemented before they are used. 

o The coefficients of all positive operands from one input vector are 
made negative before they are used. 

Table 2-1 provides approximate vector timing information for selected 
CYBER 205 instructions based on 64-bit operands. Instructions using 
32-bit operands or linked triadic operations run at twice the speed; 
linked 32-bit triadic operations run at four times the speed. 
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TABLE 2-1. APPROXIMATE TIMING CHART FOR SELECTED CYBER 205 INSTRUCTIONS 

Let N - number of output operands (64-bit) 
Let Z =■ length of control vector 



Vector 
Operation 



Cycles (20 nsec) 
2 Pipes 4 Pipes 

32-Bit 64-Bit 32-Bit 64-Bit 



Add/Subtract 51 + N/4 

Multiply 52 + N/4 

Linked Multiply 84 + N/4 

and Add 

Divide/Square root 80 + N/.61 

Divide/Square root 80 + N/1.22 

Upgrade 

Periodic Gather 

Periodic Scatter 

Compress 

Merge 



51 + N/2 

52 + N/2 
84 + N/2 

80 + N/.32 

80 + N/.64 

39 '+ N/.8 
71 + N/.8 
52 + Z/2 
58 + Z/2 



51 + N/8 

52 + N/8 
84 + N/8 

80 + N/1.22 
80 + N/2.44 



51 + N/4 

52 + N/4 
84 + N/4 

80 + N/.64 
80 4- N/1.28 

39 + N/.8 
71 + N/.8 
52 + Z/4 
58 + Z/4 



Therefore the peak performance on linked triads using 32-bit -operands is 
800 million floating point operations (addition, subtraction, and multi- 
plication) per second "(MFLOPS) for a computer with four pipelines and 400 
MFLOPS for 64-bit operands with four pipelines. 

The Input/output system contains 8 or 16 I/O ports, each 32-bits in 
width. Each I/O port is capable of transferring up to 200 million bits 
per second. Total bandwidth for the input/output system is 3200 million 
bits per second. 
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2.2 Memory 



The memory subsystem is available in sizes of 1 million, 2 million, or 4 
million 64-bit words. Memory words are 78 bits long, providing a 64-bit 
data word and 14 bits for single error correction and double error detec- 
tion (SECDED); seven bits for each 32-bit half word. . Each half million 
words of memory contain 16 memory stacks arranged in eight phased 
banks. Sequential addresses are assigned to different banks by using 
bank phasing. Because the banks are independent, a bank can begin a 
memory cycle before adjacent banks have completed previously initiated' 
cycles. 

Memory is addressable by single bits, 32-bit half words, 8-bit bytes, or 
64-bit full words. 

Memory bandwidth of the CYBER 205 permits maximum transfer rate 
concurrently for all I/O ports, while supporting maximum vector 
processing rates. 
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3.0 VECTOR PROCESSING METHODOLOGY 

f 

Two phases exist during execution of a CYBER 205 vector instruction. The 
first, phase is called the startup phase. During this phase the hardware 
sets up sufficient buffering for the data streams to ensure that no 
memory conflicts occur during the stream phase. Also the segmented 
functional units (pipes) are initially filled and emptied. The time 
consumed by the startup phase varies with the different vector operations 
(multiply, divide, etc.) but is independent of the vector length. ' The 
second phase is called the stream phase. The time spent during this 
phase is directly proportional to the vector length, mainly the number of 
cycles needed to produce one result times the number of result operands. 

The performance of a vector operation is monotonic as a function of • 
vector length. With increasing vector length the startup time becomes 
less significant and the performance becomes asymptotic to the peak 
performance rate. Figure 3-1 and tables 3-1 through 3-5 illustrate the 
performance in millions of floating point operations per second (MFLOPS) 
as a function of the vector length. 

Before we can talk about vector processing on the CYBER 205, we have to 
define a CYBER 205 vector. A CYBER 205 vector is a set of contiguous 
memory locations. For real or integer vectors, the memory locations are 
64-bit words or 32-bit half words; and for bit vectors they are bits. 

An example of a vector is a one-dimensional FORTRAN array: 



I A (1) ) A (2) | } A (N) | 

Therefore, the FORTRAN DO loop: 

DO 1000 I - 1,N 

C (I) - A (I) + B (I) 

1000 CONTINUE 
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Figure 3-1. Approximate MFLOPS Rate as a Function of Vector Length for Vector 
Multiply With 64-bit Operands 
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becomes a vector addition, and the DO loop collapses into one instruction 
on the CYBER 205. 

For the following DO loop: 

DO 1000 I - 1,N,M ' 
C (I) "- A (I) * B (I) 
1000 CONTINUE 

If M « 1, this is a vector operation where vectors A, B, and C are, 
contiguous sets of memory. 

If M > 1, vectors A, B, and C are noncontiguous and therefore do not 
satisfy the definition of a CYBER 205 vector. 

The case of H ) 1, however, occurs frequently in general scientific and 
engineering applications; therefore, the CYBER 205 hardware provides a 
set of Instructions to make noncontiguous data structures contiguous. 
These data motion Instructions allow the • generation of temporary 
contiguous vectors from noncontiguous data structures. In the following 
paragraphs we discuss some of the vectorization methods for noncontiguous 
data structures. 

3.1 Control Store 

Any vector operation associated with a control store uses a bit vector. 
A bit vector is defined as a CYBER 205 vector whose elements consist of a 
contiguous set of bits. Each bit corresponds to an element in a data 
vector. The' control store operation stores the elements of a data vector 
whose corresponding bits in the bit vector are 1 into the corresponding 
positions of a result vector, and leaves the elements untouched whose 
corresponding bits are 0. The control store operation is done 
concurrently with the vector arithmetic operation and does not take 
additional computer time. The following example illustrates a control 
store: 
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M— 3 

DO JOOO I - 1,N,M 

C(I) - Ad) - B(I) 

1000 CONTINUE 

It can be seen from this DO loop that the difference between A(l) and 
B(l) is calculated for every third element. By using the control store 
technique we calculate the difference for every element but store only 
every third result into C(I). To be able to do that we have to define a 
bit vector where every third bit is 1 and the remainder of the bits are 
0. 



bit vector > 1001001001001. 



Note that, independent of M, N arithmetic operations will be done. Thus, 
for M = 3, the peak result rate will be one-third the peak hardware rate. 

The number of redundant operations is a linear function of M and 
efficiency decreases as M increases. Therefore, alternatives • to the 
control store algorithm have to be used for larger values of M, which 
makes the control store an unattractive solution. 

3.2 Vector Compress/Merge 

The compress instruction will cause those elements of a data vector that 
correspond to l's in the bit vector to "be compressed into a temporary 
vector whose length is equal to the number of l's in the bit vector. 

Using the vector compress method on our previous example, we have to exe- 
cute the following instruction sequence: 

compress vector A into Aj 
compress vector B into B_ 

merge vector C with vector C- 

Each one of these steps corresponds to a CYBER 205 machine instruction. 
Cj, Aj., and Bj are contiguous temporary vectors of length 1 + (N-l)/M. 
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TABLE 3-1. APPROXIMATE VECTOR ADD PERFORMANCE Iti MFLOPS 

1 Vector Pipeline 2 Vector Pipelines 4 Vector Pipelines 
N 32-Bit 64-Bit 32-Bit 64-Bit 32-Bit 64-Bit 



25 



19.8 



16.4 



22.7 



19.8 



23.1 



22.7 



50 



32.9 



24.8 39.7 



32.9 



43.9 39.7 



100 



49.5 



33.1 65.8 



49.5 79.4 



65.8 



500 



83.1 



45.4 142.0 



83.1 221.2 142.0 



1,000 
10,000 
50,000 
Asymptote 



90.7 



99.0 



99.8 



100.0 



47.6 166.1 



49.7 196.0 



49.9 199.2 



50.0 200.0 



90.7 284.1 166.1 



99.0 384.3 196.0 



99.8 396.8 199.2 



100.0 400.0 200.0 
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TABLE 3-2. APPROXIMATE VECTOR MULTIPLY PERFORMANCE IN MFLOPS 

1 Vector Pipeline 2 Vector Pipelines 4 Vector Pipelines 
N 32-Bit 64-Bit 32-Bit 64-Bit 32-Bit 64-Bit 



25 19.5 16.2 21.6 19.5 22.7 21.6 

50 32.5 24.5 39.1 32.5 43.1 39.1 

100 49.0 32.9 64.9 49.0 78.1 64.9 

500 82.8 45.3 141.2 82.8 219.3 141.2 

1,000 90.6 47.5 165.6 90.6 281.5 165.6 

10,000 99.0 49.7 195.9 99.0 384.0 ' 195.9 

50,000 99.8 49.9 199.2 99.8 396.7 199.2 

Asymptote 100.0 50.0 200.0 100.0 400.0 200.0 
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TABLE 3-3. APPROXIMATE LINKED VECTOR MULTIPLY AND ADD PERFORMANCE IN MFLOPS 



1 Vector Pipeline 2 Vector Pipelines 4 Vector Pipelines 
N 32-Bit 64-Bit 32-Bit 64-Bit 32-Bit 64-Bit 



25 26.0 22.9 27.8 26.0 28.7 27.8 

50 45.9 37.3 52.1 45.9 55.6 52.1 

100 74.6 54.3 91.7 74.6 104.2 91.7 

500 149.7 85.6 239.2 149.7 342.5 239.2 

1,000 171.2 92.3 299.4" 171.2 478.5 299.4 

10,000 196.7 99.2 ' 387.0 196.7 749.6 387.0 

50,000 199.3 99.8 397.3 199.3 789.4 397.3 

Asymptote 200.0 100.0 400.0 200.0 800.0 400.0 
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TABLE 3-4. APPROXIMATE VECTOR DIVIDE AND SQUARE ROOT PERFORMANCE IN MFLOPS 



1 Vector Pipeline 2 Vector Pipelines 4 Vector Pipelines 
_N 32-Bit 64-Bit 32-Bit 64-Bit 32-Bit 64-Bit 



25 



7.9 



5.3 



10.3 



7.9 



12.5 



10.5 



50 



10.6 



6.4 



15.5 



10.6 



20.8 



15.8 



100 



12.8 



7.1 



20.6 



12.6 



31.1 



21.2 



500 



15.2 



7.8 



27.8 



15.2 



51.1 



29.0 



1,000 



15.6 



7.9 29.1 



15.6 



55.6 



30.5 



10,000 16.0 



8.0 



30.4 



15.9 



60.4 



31.8 



50,000 16.0 



8.0 



30.5 



16.0 



60.9 



31.9 



Asymptote 16.0 



8.0 



30.5 



16.0 



61.0 



32.0 
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TABLE 3-5. APPROXIMATE VECTOR DIVIDE AND SQUARE ROOT 
UPGRADE PERFORMANCE IN MFLOPS 

1 Vector Pipeline 2 Vector Pipelines 4 Vector Pipelines 
N 32-Bit 64-Bit 32-Bit 64-Bit 



25 (not available) 12.5 



10.5 13.9 



12.6 



50 



20.8 



15.8 25.0 



21.0 



100 



31.1 



21.2 41.7 



31.6 



500 



51.1 



29.0 88.0 



53.2 



1,000 
10,000 
50,000 
Asymptote 



55.6 



60.4 



60.9 



60.9 



30.5 102.2 



31.8 119.7 



32.0 121.5 



32.0 121.6 



58.1 



63.4 



63.9 



63.9 
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3.3 Periodic Gather/Scatter 

The periodic gather instruction creates a temporary contiguous data 
vector by transmitting elements from another vector of the same data type 
indexed by a periodic index **M." The periodic scatter instruction is the 
inverse of the gather instruction. • 

Using the gather/scatter method as our example, the following instruction 
sequence has to be generated: 

gather vector A into A_ 
gather vector B into Bj 



C. 



T 



A_ - Bj 



scatter vector C T into vector C 

Cj, Aj, and B T are contiguous temporary vectors. 

Figures 3-2 through 3-4 show the performance results measured in MFLOPS 
of the discussed vectorization methods for noncontiguous vectors. As it 
can be seen from those graphs, the best method chosen depends upon the 
values of M, N, and OPS (number of vector operations). In general, for 
small M's and OPS's, the increased number of redundant operations in the 
control store method reduces the performance. The compress/merge and 
gather/scatter methods result in a better performance in this case. With 
the increase of the number of vector operations, the time spent for the 
data motion operations (compress, merge, gather, and scatter) in 
comparison to the time spent for the vector operations becomes increas- 
ingly less significant. Also, with increasing periodic indexes M, the 
gather/scatter method clearly outperforms the other two methods. The 
periodic gather/scatter method is applied by the automatic vectorizer of 
the CYBER 205 FORTRAN compiler for periodic noncontiguous data ' 
structures. 
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Figure 3-2. Approximate MFLOPS Rates for M - 2 
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Figure 3-3. Approximate MFLOPS Rates for M = 5 
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Figure 3-4. Approximate MFLOPS Rates for M = 10 
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3.4 Random Gather/Scatter 

The random gather Instruction uses the elements of an integer index 
vector as indices to take possibly disparate elements from a source data 
vector and make them contiguous in a temporary vector. 

'-"■ For an example, the following FORTRAN DO loop can be replaced by one 
gather instruction: 

DO 1000 J = 1,N 
B(J) = A(I(J)) 

1000 CONTINUE 

The random scatter instruction is the inverse of the gather instruction. 

'DO 1000 J = 1,N 
A(I(J)) = B(J) 
1000 CONTINUE 
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4.0 MATRIX ADDITION AND MULTIPLICATION 

For two full matrices stored In the same order as contiguous data struc- 
tures, the matrix sum can be evaluated with one Instruction, provided the 
: size Is 255 x 255 or less. - 

For the product of two N x N matrices, the Inner (or DOT) product of the. 
jth row of one matrix and the kth column of the other matrix is formed. 

If one matrix is stored by rows and the other one by columns, the product 

2 

matrix can be formed with N inner products. 

The timing for the inner product instruction is: 

107 + N cycles (each cycle is 20 nsec) 
Therefore, the total time is: 

107N 2 + N 3 . 

If the two matrices are not stored by rows and columns, one must gather 
the appropriate entities. However, there is an algorithm that does not 
require the gather and is asymptotically four times faster than the inner 
product algorithm. This algorithm is called the "outer product" and is 
based on the fact that: 

A ll \ / A i2 \ / Ai, 

+ B 2k J . 1+ .... +B nk | 

Anl / \ ^2/ \ ^n 

If A is stored by columns and B is stored in any fashion, then: 

DO 1000 K - 1,N 
C(1,K,N) - B(1,K) * A(1,1;N). 
DO 1000 L - 2,N 

C(1,K;N) - C(1,K;N) + B(L,K) * A(1,L;N) 
1000 CONTINUE 

will compute matrix C and store it by columns. 
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The time for this algorithm is: 

N 3 /2 + 0(N 2 ) on a 2-pipe machine 
N 3 /4 + 0(N 2 ) on a 4-pipe machine 

Note that the inner loop of the algorithm is a linked triad. Not only is 
this algorithm asymptotically four times faster than the first algorithm, 
but the 0(N ) is much smaller due to the fact that no data motion is 
required. 

The following table shows the necessary data structures to achieve the 
matrix operations listed: 



DATA STRUCTURE 



MEAN VECTOR LENGTH 



CYCLE COUNT 

2-pipe CYBER 205 
A-pipe CYBER 205 



FULL MATRIX 
ADDITION 
C = A + B 



A rows A columns 
B rows B columns 
C rows C columns 



W 



N 2 /2 + 0(1) 
N 2 /4 +0(1) 



FULL MATRIX 

MULTIPLICATION 

C = AB 



A columns A * 
B * B rows 
C columns C rows 



N 



N 3 /2 + 0(N 2 ) 
N 3 /4 + 0(N 2 ) 



Figure 4-1 shows the MFLOPS rates for a matrix multiplication using the 
"outer product" algorithm and assumes the total number of operations to 
be 2N 3 . 
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Figure 4-1. Approximate MFLOPS Rates of NxN Matrix Multiply (64-bit Operands) 
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5.0 CONCLUSION 

The material presented in this paper is only an introduction to the 
capabilities of the CYBER 205. Many syntactic FORTRAN kernels are 
reduced to single instructions due to the vector processing capabilities 
of the system. 

The concept of vector processing introduced in this paper shows that even 
with a vector length of less than 100 elements, the performance is. very 
impressive. More importantly, implicit vectorization of FORTRAN kernels 
is done automatically by the compiler. However, to achieve peak 
performance rates and to make full use of the powerful CYBER 205 
instruction set, vecto r numerica l analysis type algorithms have to be 
developed. 
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ABSTRACT 



The CDC CYBER 200 series computer provides computational capability measured in 
the hundreds of mflops (millions of floating operations pec- second).^ This capability is 
due to its vector processing hardware. 

The goal of this paper is twofold: 

1. To illustrate this hardware capability by a selected set of syntactic kernels 
and ' 

2. To provide an introduction to the applicability of vector processing to some 
basic algorithms of numerical linear algebra. 



1. INTRODUCTION 

Although basic advances in technology have contributed significantly to the ability of 
successive generations of computers to solve problems faster, in the last decade the 
architectural revolution known as vector processing has been most responsible for 
pushing out the frontiers of calculational capability. 

Rather than acting on individual operands, the vector processor acts on assemblages of 
operands (called by various authors vectors, arrays, strings, etc.). This allows the 
processor to configure itself more efficiently to perform a certain operation. 

Let me illustrate with a trivial example. Suppose one thousand floating operands are 
to be added to another thousand floating operands and the results stored in a thousand 
words in memory. The traditional hardware (hereafter termed scalar hardware) must 
execute code somewhat similar if not identical to the following: 

Load, Load, Add, Store, Index and Branch. 



Thus, the operands must be fetched individually from memory into a staging area 
called the register file, then added, then stored back into memory one at a time 
Finally, the decision must be made whether there are any more passes to be made 
through the code sequence. For this simple example, the scalar processor must 
execute five or six thousand instructions. Moreover, when executed in the order given 
above, much of the time various parts of the processor are idle, e.g., while waiting for 
operands to be loaded into the register file, the adder is not doing any useful work 
While it is true that software can alleviate some of this problem by techniques such as 
bottom load - top store code generation, it is still true that the individual instructions 
perform such a small piece of the total algorithm that the processor can not "know" 
how best to overlap its various functional units. In particular, the branch instruction is 
usually several times slower than an arithmetic instruction. Recognizing this problem 
crafty programmers have come up with the technique of loop unrolling. Essentially' 
this amounts to doing several pieces of arithmetic per branch. This not only cuts down 
on the number of branches, but allows a segmented arithmetic functional unit to keep 
more of its segments or stations busy during a given time period. 

Several factors limit this technique in a scalar processor: 

1. the size of the instruction stack which must hold all the instructions 
necessary for one pass. 

2. the ability of the Load/Store unit to fetch and return operands to memory. 

3. the number of registers to act as a staging area. 

The ultimate evolution of this technique is to design hardware that 



1. 



has a high bandwidth (ability to pass data) between the memory and the 
calculational unit to allow simultaneous Load/Store overlapped with 



calculation. 



2. wiU route the proper number of operands without a need for branching as such 
or an explicit staging area in a register file. 

This solution is embodied in the CYBER 200 vector processing hardware. The CYBER 
200 series is an outgrowth of the STAR 100 processor. The reader that is interested in 
more hardware detail may consult (1), (2), and (5). 

Basically, there is one hardware instruction to perform up to 65535 additions 
subtractions, multiplications or divisions. At issue of such an instruction the CYBER 
200 "organizes" itself to carry out the one operation on many operands in the most 
efficient manner. 



The execution of a CYBER 200 vector arithmetic instruction consists of two phases. 
The first phase is the startup "phase. The time consumed by this phase is independent 
of vector length. During this period, sufficient buffering is automatically set up by the 
hardware to ensure no memory conflict among the various streams of data into and out 
of the memory. Also the general purpose segmented functional units (called pipes) are 
initially filled (and emptied). The second phase is the stream phase. The time 
consumed by this phase is directly proportional to the vector length, the constant of 
proportionality being the average number of cycles needed to produce one result 
operand. 

The foregoing should make it clear that performance will be monotonic as a function 
of vector length since the longer the vector, the more time to amortize the startup. 
The asymptotic performance is that performance that would occur if there were no 
startup. _ _ _ 

It should also be clear that the number of instructions issued in a given time period is 
no longer a valid measure of performance for a vector processor. Indeed, the simple 
addition kernel that we considered before can be effectively computed on the CYBER 
200 with one instruction. A more useful measure of performance is the megaflop 
(mflop), the capability to perform a million floating operations in one second. 

In terms of mflops, the state of the art for scalar computing is on the order of 
magnitude of one to ten mflops. By the very nature of scalar computing this number 
varies from kernel to kernel, as well as from programmer to programmer. By contrast, 
consider the following tables of vector performance of the CYBER 200 series. (The 
CYBER 203 numbers represent actual performance data. The CYBER 203E, a CYBER 
203 with LSI vector pipes, is at the conclusion of the design phase at the time of this 
writing. Thus its performance represents expected design goals.) The tables give 
mflop rates for 64 bit arithmetic operations. For 32 bit arithmetic, the CYBER 203 
add rate is doubled, the CYBER 203 multiply rate is quadrupled and all CYBER 203E 
rates are doubled. The bottom lines in each case are asymptotic performance. 



(1.1) 



CYBER 203 



VECTOR 






LENGTH 


ADDITION 


MUUPLICATION 


25 


7.5 


3.4 


50 


13.0 


6.0 


100 


20.7 


9.7 


500 


38.9 


19.0 


1000 


43.8 


21.6 


10000 


49.3 


24.6 


50000 


49.9 


24.9 




50.0 


25.0 



(1.2) 



CYBER 205 VECTOR ADD OR MULTIPLY MEGAFLOP RATE 



VECTOR 
LENGTH 


2 P1PE7 64-BIT 


4 PIPE, 64-BIT 
2 PIPE, 32-BIT 


4 PIPE, 32-BIT 


32 


23.9 


27.1 


29.1 


64 


38.6 


47.8 


54.2 


100 


49.5 


65.8 


78.1 


250 


71.0 


109.6 


150.6 


500 


83.1 


142.0 


219.3 


1000 


90.7 


1664 -- 


284*1 


10000 


99.0 


196.0 


384.3 


50000 


99.8 


199.2 


396.8 




100.0 


200.0 


400.0 



In addition, the CYBER 203E is expected to have linked triad capability as follows. 
Various triadic operations involve two input vector streams and one input scalar 
stream. Such triads will be calculated as one vector operation. Two important 
examples of such triads that we will see in the succeeding sections are 

1. Vector + Scalar* Vector 

2. (Vector + Scalar) *Vector 

The CYBER 203E performance for 64 bit operands on such triads will be 
(1-3) CYBER 205 LINKED TRIAD MEGAFLOP RATE 



VECTOR 




4 PIPE, 64-BIT 




LENGTH 


2 PIPE, 64-BIT 


2 PIPE, 32-BIT 


4 PIPE, 32-BIT 


32 


31.7 


34.4 


36.0 


64 


54.7 


63.4 


68.8 


100 


74.1 


90.9 


102.0 


250 


119.0 


168.9 


213.7 


500 


149.3 


238.1 


337.8 


1000 


170.9 


298.5 


476.2 


10000 


196.7 


386.8 


749.1 


50000 


199.3 


397.3 


.789.3 




200.0 - 


400.0 


800.0 



Thus vector processing even now offers an order of magnitude performance over scalar 
and very shortly will offer two orders of magnitude. Indeed, processors are now being 
designed for the mid 80's with sustainable 1000 mflop performance and peak rates of 
3000 mflops, a full three orders of magnitude over scalar performance. 



One of the fundamental laws of the universe is that you do not get something for 
nothing. This law certainly holds true for vector processing. We shall see in the next 
two sections that some programming sophistication is needed to get the full several 
orders of magnitude performance. Of course, the same law holds true for scalar 
programming. There is a remarkable difference between the performance of "casual" 
FORTRAN and "tuned" code on such sophisticated scalar processors as the CDC 7600 
(or the CYBER 200 for that matter). The difference is that the performance range on 
scalar processors is narrower, i.e., the best performance increase one can expect to 
get is usually a factor of two or three. Now we are talking about orders of magnitude. 
The stakes have gone up. Thus it is crucial to understand how to use the vector 
processing capability intelligently. 



2. VECTOR PROCESSING 



In the introduction we spoke of the programming discipline needed for vector 
processing. A large part of this discipline is the recognition of data structure. To be 
more precise, let us ask the question, "What is a vector on the CYBER 200?" 



(2.1) 



Def. Vector — Contiguous set of memory locations 



For real or integer vectors, the memory locations are words, for complex vectors pairs 
of words, for bit vectors they are bits. 

Up to this point we have been discussing the hardware of vectors. It is time to bring 
software into the picture. Where or how do vectors make their appearance from the 
programmers point of view? 

The most primitive example of a vector is a one -dimensional FORTRAN array: 



Am 


A(2) 


. . . 


A(N) 



Thus, the fundamental FORTRAN DO loop 
(2.2) 



DO 9000 I .» 1,N 

C(l) ' A(l) + B(l) 

CONTINUE 



actually performs the operation of a vector addition: 



Ad) 


A(2) 




A(N) 


+ 


B(1) 


B(2) 




B(N) 


X 


C(1) 


C(2) 




C(N) | 



How does one code the vector addition? 

There are two basic approaches to vectorization. One approach is so called automatic 
vectorization. This means writing traditional scalar FORTRAN and depending on a 
"vectorizer", a software module, to "see" the vector construct. The other approach is 
to provide high level language constructs that directly access the vector hardware. 

It is not the purpose of this paper to debate the pros and cons of these two 
approaches. As any complete system should, the CYBER 200 has both options 
available. 

The present task is to explore the second approach, with the explicit goal of 
delineating a wide range of CYBER 200 vector hardware primitives along with 
concomitant FORTRAN constructs. 

The most basic software construct in the family that we shall persue is the descriptor. 

(2.3) Daf. Descriptor — Pointer to a vector 

The internal format of a descriptor contains the starting address and the length of the 
vector pointed to. There are two FORTRAN syntaxes for the descriptor. 

The explicit descriptor has the form 

(2.4) "array name"("index",-"integer expression") 

which points to the vector whose first element is 



(2.5) 

and whose length is 



"array name'T'index' 



"integer expression". 



Thus A(1;N) is an explicit descriptor pointing to the vector whose first element is A(l) 
and whose length is N. 

There is also a variable type called descriptor which allows one to assemble the 
information for a descriptor and access it by name. Thus the declarative 

(2-6) DESCRIPTOR AD 

informs the compiler that the variable name AD is to be a descriptor. The executable 
statement 

(2-7) ASSIGN AD f A(1;N) 

compiles into code to form the descriptor and store it under the name AD. Since the 
ASSIGN statement is an executable statement, the vector pointed to by AD may be 
dynamically changed. - 

Further details about the FORTRAN descriptor construct may be found in (3) and (5). 

One uses descriptors in artihmetic statements as follows: 

'2*8) descriptor = descriptor op descriptor 

compiles into code to perform op on the vectors pointed to by the descriptors on the 
right hand side and store the result in the vector pointed to by the descriptor on the 
left hand side. 

Bringing all of these facts together, the DO loop (2.2) reduces to 

(2.9) C(1;N) - A(1;N) + B(1;N) 
or equivalently 

(2.10) CD = AD + BD 

if the descriptor variables have been properly initialized. 
Next let us consider two-dimensional arrays. Consider 

(2-1D DIMENSION A(N,N),B(N,N),C(N,N) 

DO 9000 I = 1,N 
DO 9000 J - 1,N 

can = a(j,i) .* b(j,o 

9000 CONTINUE 



It is clear that the inner loop is a "vectorizable" kernel. Indeed, the code 



(2.12) 



DO 9000 I - 1,N 
C(1,I;IM) - A(1,I;N) + B(1,I;N) 
9000 CONTINUE 



eliminates one level of nesting and executes with mean vector length N. 

However, a moment's reflection will reveal that for the purposes of the DO loop (2.11) 
the total data structure involved in the kernel is one vector for each array. Consider 



A(1,1) 


A(2 f 1) 




A<N,1) 


A(1,2) 


. . . 


A(N,2) 


. . . 


A(1,N) | . . . 


A(N,N) 






■1*2 






4 | = N- 














* 





Thus the entire DO loop (2.11) can be reduced to 

(2-13) C(1,1;N»N) = A(1,1;N*N) + B(1,1;N»N) " 

or even 

(2.14) CD - AD + BD 

The vector version (2.13) executes with vector length N 2 . This results in greatly 
improved vector performance. The reader is invited to check this fact, say for N = 100. 

If all the data structures encountered in applications programs were contiguous, our 
discourse would be over. The straightforward descriptor syntax could be used to 
generate vector instructions and all problems would be solved!! 

There are, of course, three main reasons why this scenario is naive. 

1. Data structures are often non -contiguous. 

2. Arithmetic may depend on conditional statements. 

3. Even when vectors exist, their existence may be masked by the scalar 
expression of an algorithm. 

Points #1 and #2 are syntactic in nature, and wiU be discussed in the remainder of this 
section. Point #3 is more semantic in nature and wiU be discussed to a limited extent 
in the next section. 

Let us consider the same double DO loop (2.11) with the following declarative 

(2.15) DIMENSION A(N+1,N+1),B(N+1,N+1),C(N+1,I\I+1) 



While the vectorized version (2.11) still is valid, it would seem that the superior 
solution (2.13) is no longer possible. This lowers the mean vector length from N2 to N. 



A{1.1) A<2,1) . . . A(N,1) 



A(N+1 f 1) 



A(1,2> 



A(N,2) 



A(N+1,2) . . . A(N+1,N+1) 



I = 1- 



I = 2 



not used 



not used 



not used 



Since the majority of data elements of the arrays is active, it seems a shame to lose 
the more effective vector length. Indeed, since each-t>f the three arrays has the same 
activity pattern 



ON 


ON 


. . . 


ON 


OFF 


ON 


ON 




ON 


OFF 




ON 




ON 

___ 


OFF 



if only one could suppress the arithmetic on every (N+l)st element, the higher 
megaflop rate could be restored. 

This brings us to the first vector primitive other than basic arithmetic. 



(2.16) 



Def. Control Store — Given i. A source arithmetic vector pointed to by AD 

AD > A.,, A 2 , A3, . . . . , A N 

Given ii. A source bit vector pointed to by BITD 
BITD > b r b 2 , b 3 b N 

Given iii. A target arithmetic vector pointed to by BD 
BD > B 1 , B 2 . B 3 , B N 



the process of control store stores Aj over Bj whenever bj is T and leaves Bj untouched 
when bj is '0', all at vector speed. 

The control store operation can be done as a separate operation which executes at the 
same rate as vector addition. It is also possible to do the control store as part of a 
vector arithmetic instruction. When done in this fashion, it literally leaves the 
execution time invariant. 

In the FORTRAN environment, the control store can be affected with the syntax 

(2.17) BD =» Q8VCTRL(AD,BITD;BD) 

where AD is a descriptor for the source vector, BITD is a descriptor for the controlling 
bit vector and BD is a descriptor for the target vector. 



When done as part of an arithmetic operation, at the present time, there is a "special 
call syntax" which allows the FORTRAN programmer to imbed META (CYBER 200 
assembly language) in a FORTRAN program. In particular, the DO loop (2.11) with the 
effective declarative (2.15) can be reduced to one FORTRAN "special call syntax" 
statement, namely 

(2.18) CALL Q8ADDNV(„AD„BD,BITD f CD) 

where BITD is a descriptor pointing to a bit vector consisting of 

1.1.1, . . , 1,0,1,1, . .1,0, . . . 0,1,1, . . , i,6 

\» N— •). \-N-\ l^-IM-^l 

and AD.BD.CD are descriptors pointing to the entire arrays as vectors. 

Thus we are able to operate on the non-contiguous data structure with almost fuU 
efficiency. The "extra work" we must perform includes the bit vector construction and 
the redundant arithmetic. 

As for the bit vector, it is possible to construct this particular one with one instruction 
which translates to one line of FORTRAN. This instruction is very efficient, and in 
any case, bit vectors of this type have values which are structure dependent rather 
than value dependent. Hence the values in the bit vector do not change over the 
course of execution. 

As for the redundant arithmetic, intuitively it is clear that the half cycle spent doing 
the addition on each (N+l)st element whose storage to memory is suppressed by the 
controlling bit vector, is a small price to pay to avoid the extra vector startup. 

Quantitatively, let 

S = Startup time for a vector addition 

R = Stream rate for a vector addition 

A,B,C be dimensioned (N+M)x(N+M) 

Then the control store technique is superior to the naive technique of restarting the 
vector whenever 

S/R>M 

Since S/R is between 150 and 200, if Mis "small, as in our little example, clearly the 
control store technique is superior. 
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™L^?wZlT niqUe fa *° USeM * VeCt ° rizi ^ IF test ^Ited arithmetic 

(2.19) Def. Vector Regional - Given |. Two source arithmetic vectors pointed to by AD and BD 

AD > A v a 2 , A 3 A N 



BD > B v B 2 . B„ 



, B 



N 



the statement 
(2.20) 



Given ii. A target bit vector pointed to by BITD 
BITD> b v b 2 , b ?: ...,b N 

BITD - AD.GT.BD 



Consider the following simplified one-dimensional flux tradine model. It is „« iniA H 
that an arrav of fluv pat*»« rwrnv-n t-i xt ■ * i» , * mouei. u is assumed 



(2.21) 



9000 



DO 9000 J-1,N 
FLUX(J) - DT»DXDT(J) 

IF(FLUX(J).GT.X(J)) GO TO 9000 
X(J) - X(J) - FLUX(J) 
CONTINUE 



NOTE THAT THE ORDER OF EVENTS IN SCALAR Is 



j ARITHMETIC | DECISION ■ • . | ARITHMETIC | DECISION . ETC. 



SS «^%^F<S£X£?"T ¥ d ? osition WIth eaoh ° peratlon 

Using implicit descriptors, the vectorized form of this kernel is: 



(2.22) 



FLUXD - DT«DXDTD 
BITD - FLUXD. LE.XD 
CALL Q8SUBNV („XD„FLUXD,BITD,XD) 
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The first line calculates all the fluxes with one vector operation. The second line 
builds the decision structure, i.e., the bits pointed to by BITD contain the outcomes of 
all the IF tests. This is again one vector operation. The third line trades the fluxes 
whenever possible with one vector operation. 

As for efficiency, one assumes in a well behaved model that the percentage of l's in 
the bit vector is high. 

Let me make two final comments on this kernel. First, if one suspects that the model 
is behaving poorly, i.e., the bit vector is mostly O's, one can ascertain very easily how 
many l's there are. There are several efficient alternatives. We will not speak of 
them here. Secondly, the scalar DO loop (2.21) compiles irrto 12003-instructions. The 
vector kernel compiles into 3 instructions. 

While the control store is an invaluable tool, the following kernel iUustrates the need 
for what we shall call "data motion" primitives. 

Suppose it is desired to evaluate a polynomial 

(2.23) 

i 

Y » Z A X k 

k-0 i -k+1 



for various values of X which are evenly spaced within an array. Using Horner 
decomposition, such a kernel might be written J 

(2.24) DO 9000 J«1,N,M 

Y(J) - A(1) * X(J) 

DO 9010 K=2,L 

Y(J) - Y(J) + A{K) 

Y{J) - X(J) * Y(J) 
9010 CONTINUE 

Y(J) - Y(J) + A(L+1) 
9000 CONTINUE 

If M=l, this is clearly a vectorizable kernel: 

(2.25) YD - A(1)»XD 

DO 9010 K=2,L 
YD = YD+A(K) 
YD = XD*YD 
9010 CONTINUE 

YD - YD+A(L+1) 
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If M>1, one can write the above kernel using special call syntax with the appropriate 
bit vector and control store the arithmetic. 

Note that, independent of M, 2LN arithmetic operations will be done. Thus, for M=2, 
the peak result rate will be half the peak hardware rate. 

The problem here is that the data structure is "fractured". Indeed only 1 out of each M 
elements is active. 

Perhaps it would be better to get the active elements together. For this particular 
kernel, we should investigate the compress. 

(2.26) Def. Compress - Given i. A source data vector pointed to by AD, ie. 

AD > A v A 2 , A3, . . . , A|y, 

And a bit vector pointed to by B1TD, ie. 

BITD > b v b 2 , b 3 , b N 

Given ii. A target data vector pointed to by BD, ie. 

BD > B 1 B, 

The statement 

BD - Q8VCMPRS (AD,BITD; BD) 

will cause those elements of the vector that 
correspond to 1's in the bit vector to be 
compressed into the B vector. 

Concomitantly there is an operation called expand which allows one to put the answers 
back into the proper places in the array. 

The alternative to the control store algorithm is to compress the X values into a 
temporary vector, do full efficiency vector arithmetic and expand the answers. In 
comparing the two approaches we shall assume that N is large enough to ignore startup 
time. We also note that the pair 

YD = YD + A(K) 

YD = YD *XD 

is a linked triad and can be calculated at one cycle per result per pipe on the CYBER 
203E. 

The following chart contains the cycle count (20 nsec. cycles) for the stream time for 
the two approaches on various architectures. 
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(2.27) 



RECALL N =■ LENGTH OF ARRAY 

M = DISTANCE BETWEEN ACTIVE ELEMENTS 
L = DEGREE OF POLYNOMIAL 





CYBER 203 


2 PIPE CYBER 203E 


4 PIPE CYBER 203E 


CONTROL 
STORE 


3NL 


f(L + 1, 


in*,. 


COMPRESS 


— ¥ 


N + N(L+1) 
2M 


N ( N(L+1) 
2 4M 



For M >1, as L gets larger it is clear that the compress gets more efficient. "This is 
true since the- compress-expand time can be amortized over more arithmetic 
operations, whereas the redundant calculation in the control store technique varies 
directly with the total arithmetic operation count. In particular, one can construct the 
following table of crossover values, i.e., those values of L at which the 
compress-expand technique becomes superior for fixed values of M. 



(2.28) 



M - 2 

M =■ 5 
M - 10 



STAR 100 



L - 6 

L = 4 
L - 3 



2 PIPE CYBER 203E 



L = 3 
L = 2 
L - 2 



4 PIPE CYBER 203E 



L = 3 
L = 2 
L = 2 



FinaUy, while the exact algorithm we have analyzed is not that common, it is 
iUustrative of a large class of syntactic kernels, i.e., those that require much 
intermediate arithmetic on non-contiguous data. Neither the polynomial form nor the 
periodicity of data were essential to the discussion. 

Another data motion primitive is the merge. 

(2.29) Def. Merge - Given i. Two source data vectors pointed to by AD and BD. 

And a bit vector pointed to by BITD. 

Given ii. A target vector pointed to by CD, the statement 

CD - Q8VMERG(AD,BD,BITD; CD) 



will examine consecutive bits in the bit vector. If the 
bit is 'V, the next unused element of the A vector will 
be placed in the next contiguous element in the C vector. 
Concomitantly, if the bit is '0', the next unused element 
of the B vector will be placed in the C vector. 



14 



For example, if 

AD>A 1) A 2> A3,A 4 

BD>B 1 ,B 2 ,B3,B 4 ,B5 

BITD>100 111000 
Then after merging, 

CD> Ax, Bi, B 2 , A 2 , A3, A 4 , B 3 , B 4 , B 5 

This operation allows one to vectorize kernels with IF tests by both the control store 
and compress-merge approaches. 

Consider the following calculation tree 

IF TEST 
ARITHMETIC <X >Q ARITHMETIC 




Where the arithmetic depends on the results of an IF test, the whole tree being inside a 
DO loop. 

The IF test translates to a vector relational creating a bit vector. Then one can do all 
the arithmetic in both branches using the bit vector to control store results; or one can 
use the bit vector to compress out active data, do full efficiency vector arithmetic on 
the active data, and conclude with a merge back into the original data structure. 

The relative efficiency of one approach over the other depends on such factors as 
vector length and the amount of arithmetic in each branch. 

There are two limitations to the control store and compress-merge techniques 

1. The output data stream elements are in the same order as the input data, i.e., 
if Aj occurs before A^ in a data structure before compressing or merging, the 
same will be true after the given operation. 



15 



2. The time to process a control store, compress or merge is proportional to the 
total data structure size. For very sparse active data this exacts a high price. 

Let us illustrate the second point with an example that more properly belongs to the 
next section. 

Let A be an N X N matrix stored by columns, i.e., standard FORTRAN storage. 
Suppose we wish to operate on a row of the matrix, say to equilibrate it. One could 
compress out every Nth element from the N 2 elements. This would take (N 2 ) cycles 
even though we seek only N elements. 

To overcome these limitations, there is the capability of scatter/gather, 

(2.30) Def. Gather — Given: A source data vector pointed to by AD, an index list, ie., 

an integer data vector pointed to by ID and a target data 
vector pointed to by BD, the statement 

BD - Q8VGATHR (AD, ID; BD) 

will use the integers pointed to by ID as indices to take 
possibly disparate elements from the A vector and make 
them contiguous in the B vector. 



Essentially, it accomplishes the following: 



(2.31) 



DO 9000 J=1,N 
B(J) - AO(J)) 
9000 CONTINUE 
and does it in (N> cycles. 



Concomitantly, there is Q8VSCATR to accomplish 
(2.32) A(l(J)) - B(J) 



The constant of proportionality in the O(N) time to gather or scatter changes radically 
with succeeding architectures 



(2.33) 



ARCHITECTURE 


TIME PER FETCH 


RATIO 


STAR 100 MICRO-CODED INSTRUCTION 
CYBER 203 SCALAR L/S ALGORITHM . 
CYBER 203E MICRO-CODED INSTRUCTION 


800 ns 

100 ns 

30 ns 


1 

8 

26 
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This tremendous efficiency increase will be needed as we move to three-dimensional 
data structures, gathering and scattering two-dimensional and one-dimensional 
substructures. 

For our final kernel in this section, let us consider a subroutine to calculate a 
piecewise linear function on a vector of data. * 

Let f(nA) = Y(n+1) n = 0,N and for nA < X < (n+1) A 

f(X) = Y(n+l) + (X-nA) (Y(n+2)-Y(n+D) 

A 



We wish to calculate Z(J) = f(X(J)) J=1,N where the values of X(J) are not in any 
pre-assigned order. Thus, we must gather the ordinate values "to the left and right". 



Y<1) 




If we multiply the X values by 1/A, the resulting vector elements wiU have an integral 
part which is equal to nA such that nA<X<(n+l)A and hence the fractional part is 
(X-nA). 

Propitiously, the operation "greatest integer less than or equal to" is one vector 
instruction on the CYBER 200. Hence, the algorithm can be completely vectorized. 

1. Multiply X vector by 1/A 

2. Find "floor" of resulting vector and use as index list for gather 

3. Subtract to get vector of (X-nA)'s 
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4. Gather left hand ordinate values 
Gather right hand ordinate values 

5. Evaluate f(X) with simple vector arithmetic 

Further syntactic details of data motion can be found in (3) and (5). 



3. VECTOR NUMERICAL LINEAR ALGEBRA 

Matrices can be divided into four classes depending on the structure of sparsitv i e 
the number and placement of zeros. 

1. Full matrices - Those matrices where one may not assume any given element 
is a priori zero 

2. Large banded - Those matrices that are NxN such that there exists P<<N 
with A k}J - =!0 for |k-j | _>P., Pictorially such matrices look like 




where the hatched-in region contains the non-zeros. Usually P-^QD as N -»0O. 



3. Structured sparse - Those matrices that are N x N with 0(N) non-zeros placed 
in a structured pattern, usually along a fixed bounded number of diagonals. 
Pictorially such matrices look like 
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4. Random sparse - Those matrices that are N x N with 0(N) non-zeros placed in 
a random pattern. 

The various sparsity regimes just outlined demand widely differing data structures for 
optimal execution of various numerical linear algebraic algorithms on a vector 
•processor. While this fact is true and is being recognized within the computing 
community, there is also a belief that this is not true for scalar processors. Before we 
get on to the proper discussion of this section, let me address this topic. 

It is not true that one can ignore data structure on a scalar processor and get optimal 
performance. There are basically two reasons why this misbelief is widely held. 

1. The difference in performance on a scalar processor between "casual" 
FORTRAN and optimal FORTRAN is usually much less than an order of 
magnitude. Thus the most one can procure for one's efforts is usually a 
factor of 2 or 3. 

2. The indexing nomenclature of FORTRAN would lead one to believe that 
accessing a multi-dimensional data structure is painless regardless of the 
order of acquisition. 

The fact of the matter is that one can improve scalar performance by carefully 
indexing within a data structure. But the improvement to be attained is not nearly so 
pronounced as on a vector processor. But rather than saying that a scalar processor 
can execute the primitive operations (such as loading, arithmetic, storing, etc.) as well 
for structured sparse problems as for full problems, it is true that to a certain extent 
it executes the primitive operations for a full problem as poorly as it does for a 
structured sparse problem. 

The final truth is that for a contiguous data structure, the processor can maximize its 
ability to exchange data with memory and process it. The vector hardware of the 
CYBER 200 is designed around this basic truth. 

We shall discuss various fundamental algorithms for matrices in sparsity regimes #1-3. 
Even on scalar processors, regime #4, the random sparse regime, needs very careful 
attention. As of yet it is not possible to vectorize operations in this regime to the 
same extent as the other three. 

Let us begin with matrix addition and multiplication for full matrices. 

If two full matrices are stored in the same order as contiguous data structures, we 
have already seen that the matrix sum, being an element by element sum can be 
accomplished with one instruction provided the size is 255x255 or less. However, even 
for a 1000x1000 case one would need only 16 instructions. In particular, for columnar 
or row storage, (2.13) and (2.14) illustrate this fact. 
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Matrix multiplication is not quite so trivial a matter. Let us recall the definition of 
the product of two NxN matrices: 

Given full matrices A,B both NxN then for C = AB, 

N 
Cj k = 2 Aj! B lk 

1=1 

If one examines (3.1) one can see that the j,k element of C is the inner (or DOT) 
product of the jth row of A and the kth column of B. This naturally leads, to our first 
algorithm. If A is stored by rows and B by columns, the product matrix can be formed 
in any storage fashion with N 2 inner products. -~ — — 



(3.1) 



\f v\ 



A j1; A j2' 



*JN 



Ik 
J 2k 



A 



Bi 



Nk 



On the CYBER 203, the timing for the inner product instruction is 

12N + 260 (20 nsec.) cycles 
while for the CYBER 203E it is projected to be 

N + 0(1) cycles. 
Thus, the total time for the first algorithm is 

,CLE COUNT , N 3 + q(n2) cyber 2Q3 

CYBER 203 
2 PIPE CYBER 203E * + 0(N 2 ) CYBER 203E 

4 PIPE CYBER 203E 

a a is not stored by rows or B by columns, one must gather the appropriate entities. 
This will affect only the 0(N 2 ) term of course. However, one should not expect to find 
this particular storage order too often. One would expect that both would be stored 
one way or the other. 
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Is it possible to find an algorithm that dispenses with the necessity of the gathers and 
yet is efficient? Indeed there is. Not only is it possible to do away with the gathers, 
but asymptotically this second algorithm we shall consider will be four times faster 
than the first algorithm !! 



It is. based on the linear algebraic fact that 



(3.2) 




A11 



s 1k 



+ '• B 



V 



k IM1 




+< B 



Nk 



A 1N \ 

\*tmj 



This leads to the so called "outer product" algorithm, namely: 
If A is stored by columns, B is stored in any fashion then 



(3.3) 



DO 9000 K - 1,N 
C(1 r K;N) - B<1,K) * A<1,1;N) 

DO 9010 L = 2,N 
C(1,K;N) - C(1;K;N> + B(L,K) » A(1,L;N) 
9010 CONTINUE 

9000 CONTINUE 



will compute C and store it by columns. 

Note «ytt the vector statement in the inner loop is a linked triad. 
AEAN VECTOR LENGTH 

The time for this algorithm is 

3N3 + 0(N 2 ) CYBER 203 



N 3 /2 + 0(N 2 ) 
N 3 /4 + 0(N 2 ) 



CYBER 203E two pipe 
CYBER 203E four pipe 



Not only is this algorithm asymptotically four times faster than the first algorithm, 
but the 0(N 2 ) is much smaller due to the fact that no data motion is required. 

The reader versed in linear algebra will immediately see that the dual form of (3.2) 
which expresses the Kth row of C as a linear combination of the rows of B yields an 
algorithm identical in performance to (3.3). This algorithm demands that B be stored 
by rows and calculates C by rows, but places no restrictions on A. 
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To sum up, the following table presents the necessary data structures to achieve the 
matrix operations listed with the mean vector length given on various architectures. 

(3.4) 





FULL MATRIX 

ADDITION 

C - A + B 


FULL MATRIX 

MULTIPLICATION 

C = AB 


DATA STRUCTURE 


A rows A cols. 
B rows B cols. 
C rows C cois~~ 


A cols. A * 
B » B rows i 
.,: C cols. Crows ' 


- - 


N 2 .... 


_._ N .._- 




N 2 + 0(1) 
N 2 /2 +0(1) 
N 2 /4 + 0(1) 


3N 3 + 0(N 2 ) 
N 3 /2 + 0(N 2 ) 
N 3 /4 + 0(N 2 ) 



Since the mean vector length for the multiplication algorithms is the length of a 
column of the matrix, it is clear that the data for the entire matrix need not be stored 
contiguously, but merely the data for the columns (or rows). In particular, if an 
augmented matrix is stored by rows, this will not affect the efficacy of the algorithm. 

Next let us discuss the operations of matrix addition and multiplication for the other 
two sparsity regimes. 

We shall start with the structured sparse, since it will allow us to introduce a new 
storage order. 

For a structured sparse matrix, and in certain circumstances for a large banded 
matrix, the preferred storage order for matrix multiplication is diagonal storage. The 
number of non-zeros in each row or column is small, usually less than ten. Hence there 
is no way to get efficient vectorization with this sparsity. However, the number of 
elements in the non-zero diagonals varies from N to N-P+l. This is indeed more like 
what we seek. Can this storage order give rise to an efficient algorithm, though? The 
answer is yes. The details are left to the reader who may consult (7). We shall 
illustrate this fact with the product of two tridiagonal matrices. 
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These matrices are such that 

A K,J = unless |K-j| < 1. 

The product of two tridiagonal matrices is pentadiagonal. The main diagonal terms of 
the product are given by 



(3.5) 



C KK = A K,K-1 B K-1,K + A KK B KK + A K,K+1 B K+1,K K " 1 ' N 
where A 1(J » B Q1 = A N N+1 = B N+1 N - 0. 



as K runs from 1 to N 

C KK traverses Main diagonal of C 

A K,K-1 traverses Sub diagonal of A 

B K-1,K traverses Super diagonal of B 

A KK traverses Main diagonal of A 

B KK traverses Main diagonal of B 

A K,K+1 traverses Super diagonal of A 

B K+1,K traverses Sub diagonal of B 

Hence (3.5) directly translates into vectorized CYBER 200 FORTRAN. The exact 
form will depend upon how the descriptors for the various diagonals are set up. 

Large banded matrices lie somewhere between the extremes of full and structured 
sparse. We shall see that both of the storage orders we have discussed (row-column, 
diagonal) can be efficiently applied to this sparsity regime. If all we were interested 
in were addition and multiplication of matrices, the diagonal order would be 
preferred. However, usually one is interested in the solution of the linear system 
embodying the matrix. For this, row-column storage is preferred. Again, if the large 
banded matrix is formed from a product of, say tridiagonal matrices, the diagonal 
order is preferred. The individual choice must be made on the basis of more exact 
problem details. 

As for the storage orders themselves, it should be clear that the algorithm illustrated 
for the product of tridiagonal matrices goes over to large banded matrices. Indeed, 
diagonal storage can even be used for full matrices. The stream time for 
multiplication in this case will be the same as it was for the row-column algorithm, but 
the startup time will be longer. One can see this qualitatively by noting that each of 
the vectors in the row-column algorithm have length N while in the diagonal algorithm 
lengths will vary from one to N. 



23 



The row-column algorithm for full matrices depended upon having contiguous 
non^zeros in a given row or column. In the full case, there are N of them. In the large 
banded case, there are between P and 2P-1. Hence the row-column algorithm goes 
right over with a mean vector length of 2P-0(p2/N) which we assume is 2P-0(1). 
There is some ancillary bookkeeping to do to keep track of the top and bottom of the 
non-zeros, but no more so than for a scalar process. 

As for matrix addition, the fundamental fact still holds, namely, if the two matrices 
have their non-zeros stored contiguously in the same order, whether row, column or 
diagonal, the sum can be calculated with one instruction. 

Let us turn to the subject of linear equation solution?- There are two-basic approaches 
to this problem, the direct and the iterative. For full (and large banded) systems, we 
shall illustrate Gaussian Elimination and SOR, the most popular direct and iterative 
procedures. We shall see that each of these algorithms can be implemented on the 
CYBER 200 with great efficiency. 

For structured sparse systems, the implementation demands a bit more subtlety. We 
shall illustrate Red-Black SOR, a workable iterative scheme for certain structured 
sparse systems and cyclic reduction, a direct scheme for point and block tridiagonal 
systems. 

As with many direct methods, Gaussian Elimination is composed of two parts, the 
decomposition phase and the substitution phase. The decomposition phase consists of 
applying a succession of elementary transformations to the augmented matrix which 
result in upper semi-triangular form. The substitution phase consists in the solution of 
the derived upper semi-triangular system by uncoupling the variables one by one. 
Interchanging rows in the augmented matrix is an allowable elementary transformation 
and can be used to pivot. 

Let us dispense with the question of pivoting once and for all. We shall be concerned 
with implementation of Gaussian Elimination for both a rowlike and a columnar 
storage scheme for the matrix. 

Partial pivoting consists of two phases, the search and the row interchange. For 
rowlike storage, the search for the maximal element is a simple vector operation. 
Indeed the CYBER 200 has one instruction which searches for the maximum absolute 
value contained in a vector, returning the value and its location. The row interchange 
is accomplished with two gathers and two scatters. For rowlike storage, a gather is 
necessary before the search can be done, but the row interchange is a succession of 
simple vector moves. 

The decomposition consists of two parts, the multiplier formation and the reduction. 
These phases are repeated (N-l) times ..until the matrix is transformed into upper 
semi-triangular form. Le us illustrate the first occurrence of these phases. 
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Consider the matrix in original form: 
(3.6) , 

/ A, 



A 11 A 12 ' ■ ■ A 1N 



, 21 



\ 



Ani 



We wish to find a set of multipliers such that when the first row is multiplied by the 
Kth multiplier and added to the (K+l)st row, the (K+l)st element in the first column 
will become zero. Provided that An is not zero, it is clear that 

(3,7) * A 21 /A 11< - a 31 /a iv • ■ ■ • - A N1 /A ir 

is such a set of multipliers. The problem of A n being zero or even very small is part 
ol the problem of pivoting which we have already considered. 

Thus, for the multiplier formation, the columnar storage is favored since (3.7) can be 
formed with one vector multiplication without the need for a gather. 

Before we discuss the reduction phase, let us distinguish two types of rowlike storage 
for the augmented matrix. Let M be the number of right hand sides to be solved for 
The most common values for M are one and N. When M = N we are usually solving for 
the inverse of the matrix A. When M = N we will consider two cases of rowlike 
storage. In the standard case, the matrix A and the matrix of right hand sides will be 
considered to be separate. In the augmented case, the augmented matrix A:Y wiU be 
stored by rows. Thus the following elements will be contiguous: 

Ak,1 A K ,2 ». A K , N Y K ,l Y K>2 ..Y K ,m 

The first reduction phase entails multiplying the first row by the succession of 
multipliers and adding the results to the succession of rows below. 

This has traditionally been presented as a rowlike algorithm, i.e., one takes one 
multiplier, say (-A 2 i/A n ) and multiplies the whole row A 12 ...A 1N and adds to the 
whole row A 2 2 ••• A 2N* 

Indeed, this operation is a linked triad on the CYBER 203E, and if we have augmented 
rowlike storage, the entire row of A and Y-can be treated with one vector instruction. 
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However, when one investigates the total work to be done in treating the minor of the 
augmented matrix, there is no reason why one cannot use all the multipliers, multiply 
by one element in the first row and add this vector to the respective column in A 
Similarly for Yik. " * 



(3.8) 




*1K 



WkA 



Thus, the reduction phase is just as much a columnar algorithm. Perhaps the reason 
why this a posteriori obvious fact has not been more generally recognized is that 
traditionally numerical analysis texts have considered matrices as rowlike objects (It 
is curious that given this fact, FORTRAN standards treat matrices as columnar objects 
when mapping them onto essentially one-dimensional memories.) 

Since the decomposition phase has an 0(n3) operation count, while the substitution 
phase has an O(N^) count, if M, the number of right hand sides, is small then 
decomposition will clearly be the dominant phase. If, however, M = N then 
substitution also becomes an 0(N3) phase. Thus we will also consider the vectorization 
of substitution. 

First, let us summarize the vectorizability of decomposition: 
(3.9) 





Column Storage 


Row Storage 


Augmented 
Row Storage 


Multiplier 
Formation 


Vectors — Mean 

Length N/2 


Gather 
Necessary 


Gather 
Necessary 


Reduction 

M » 1 


Vectors — Mean 
Length 2N/3 


Vectors — Mean 
Length 2N/3 


Vectors — Mean 
Length 2N/3 + 1 


Reduction 

M =■ N 


Vectors — Mean 
Length 2N/3 


Vectors — Mean 
Length 5N/6 


Vectors — Mean 
Length 5N/3 
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Thus for M = 1 columnar storage is clearly superior, while for M = N, the extra time to 
gather is more than offset by the longer mean vector length during reduction. Of 
course, when M = N, we must consider the substitution phase before deciding since in 
this case the two phases have comparable operation counts. 

At the conclusion of the decomposition, the matrix is in upper semi-triangular form. If 
the multipliers have been saved, then the LU decomposition has been affected. 

The substitution phase is recursive in that one solves for 

X NK = IM K=1,M 

A NN 

and then 

X N -1,K= _! (Yn-1,K-A N -i, N X N)K ) K=l,Metc. 

A N-1,N-1 

This presents no problem for rowlike storage for M large. Indeed, the two equations 
above represent simple vector operations of length M. The problem of M=l or column 
storage seems to leave us with an inner product algorithm, i.e., 

XN ' L= a* : T (Y N-L " A N-L,N Xn " An-L,N-1 x N-1 •••) 

A N-L,N-L 

However, for column storage, if we observe carefully, it is possible to perform the 
substitution phase using vector linked triads. 

The key insight is that when X N is solved for, its affect on all the variables 
Xj, .... Xjj-i can be "put back" into the equations. This can best be seen by looking at 
the calculational tableau for a small case, say N=4. 

(Y 4 ) 

(Y3-A34X4) 

(Y2-A24X4-A23X3) 

(Yx - A 14 X 4 - A 13 X 3 - A 12 X 2 ) 



x 4 


: 1 
A44 


x 3 


1 
A33 


x 2 


1 
^22 


Xl 


1 
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After solving for X4, one can multiply the column A14.A24.A34 by X4 and subtract 
from the column Y\, Y%, Y3. This is a linked triad and when completed, the tableau is 



X 3 
X 2 
Xi 



A33 

1 
A 22 



(Y 3 ) 
(Y2-A23X3) 

(Y 1 -A 13 X3^A 1 2X 2 ) 



The above process can be repeated until all X's are solved for. 
Let us then summarize the vectorizability of substitution: 
(3.10) 





Column Storage 


Row Storage 


M - 1 
M - N 


Vectors — Mean 
Length N/2 

Vectors — Mean 
Length N/2 


Inner Product 
Necessary 

Vectors — Mean 
Length N 



This only confirms what we discovered about decomposition, namely for M = 1 
columnar storage is clearly superior, whereas for M = N the situation is reversed, 
especially if augmented row storage is possible. 

We have been discussing alternative storage schemes to optimize the vectorization of 
Gaussian Elimination. But is it possible to make a choice in general? The answer to 
this question depends upon analysis of the whole program. In particular, if the matrix 
A is formed by scalar code, one can arrange its storage order as one pleases. Even in 
certain vectorized cases one can still choose. Such considerations are global in nature 
rather than local. To get the best performance, one cannot look at one DO loop or 
even one subroutine, but at the usage of the data structure as a whole. 

If we consider the foregoing discussion for the case of large banded systems, it is clear 
that for both storage orders the preceding analysis still holds, but with vector length P 
replacing N/2 and 2N/3 in the various phases. The efficiency in this case is determined 
by the size of P. 
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It is also clear the foregoing analysis cannot efficiently be applied to structured sparse 
systems. We shall return to this question a little later. 

Let us go back to a full system which we wish to solve iteratively. There are many 
methods which have been evolved to attack this problem. But there are several basic 
considerations which are common to most or all of these methods. We shall illustrate 
these basics, starting with the Jacobi and SOR algorithms for a full system. 

Since SOR differs from Gauss Seidel only in the addition of a multiple of the diagonal 
to the right hand side, without loss of generality we shall consider the Jacobi and 
Gauss Seidel algorithms. 

The basis for an iterative solution to AX = Y is to decompose A = M-N and attempt to 
find X such that MX = NX+Y. 

This "fixed point" is sought by making a guess X<°) and repeating the algorithm 
MX<n + l> = NXW + Y until X™ + V and X«" are very close in the appropriate norm. Note 
that the iterative technique trades solving AX=Y once for solving MX^+l) = NX^ + Y 
many times. 

Much of the variety of iterative methods comes from the various ways one can form 
M,N such that M-N = A. At the heart of many of these is the following additive 
decomposition of A 

A = A D + A L + Au 

Where AD is the diagonal portion of A, A L is the lower triangular portion and Att is 
the upper triangular portion. 

In terms of the notation above, the Jacobi iterative scheme is characterized by 

Jacobi M = Arj N = -(Al + Ay) 

This method typifies those methods that do not update variables until the end of a 
given iteration. It is also well known that for a large class of problems it is the 
slowest converging method in terms of the number of iterations necessary for 
sufficient accuracy. It is not our business here to discuss this. Our object is to discuss 
the vectorization of the scheme. 

For full A, Jacobi vectorizes with length N. All one has to do is have a copy of the 
matrix A with the main diagonal zeroed out. Then the multiplication of -(Al + Au) 
times X w is multiplication of a full matrix times a vector which we already know can 
be vectorized with a linked triad algorithm with vector length N. One simply adds Y 
and the right hand side is formed. - ' - 

The solution of AdX^ +1 ) = RHS is a trivial vector multiply by the vector of reciprocals 
of diagonal elements of A. 
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By contrast the Gauss Seidel iterative scheme is characterized by 

Gauss Seidel M = Aq + Al N = -Ay 
This method typifies those that do update variables as soon as a new value is available. 
As for the vectorization, evaluation of the right hand side, 

-AyXM+Y 
Vectorizes to a linked triad algorithm with mean vector length N/2. 
The solution of 

(A D +A L )X<n + l) = RHS 

is recursive, but the form of the recursion is the same as the substitution phase of 
Gaussian Elimination. Hence it vectorizes to a linked triad algorithm with mean 
vector length N/2. 

As before, these same algorithms work efficiently for large banded systems with 
suitable modifications and large enough P. 

As for structured sparse systems, the Jacobi algorithm easily transports as long as 
-(Al + Au) is stored by diagonals. However, the Gauss Seidel algorithm does not go 
over. 

In summary, for the algorithms we have just presented 
(3.11) 





Jacobi or JOR 


Gauss Seidel or SOR 


System 


Vectors — Mean 
Length N 


Vectors — Mean 
Length N/2 


Large 

Banded System 


Vectors — Mean 
Length 2P 


Vectors — Mean 
Length P 


Structured 
Sparse System 


Vectors — Mean 
Length > N-P Q 


Not Vectorizable 



where Pq is the non-zero diagonal furthest-frbrrf the main diagonal/ 

It is fair to say that we have rather definitively treated the vectorization of the linear 
equation solution techniques just discussed, at least when the problem is main memory 
contained, for full and large banded systems. It is also fair to point out that except for 
the Jacobi iterative algorithm we have not given any results for structured sparse 
problems. 
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In a certain sense, there are no definitive answers here. That is not to say that there 
have not been algorithms developed, but that they are special case developments, 
geared to specific problems. Thus this is a very active area of research. 

We will close with two specific problems, one to illustrate the vectorization of an 
interative method for a structured sparse problem and the other to illustrate a direct 
method. 

In general, one does not use SOR type algorithms for full systems of equations. The 
typical usage of this scheme is for structured sparse problems arising in the solution of 
boundary value problems of partial differential equations. 

Suppose one wishes to solve Poisson's equation 

V 2 «k = p 

with Dirichlet boundary conditions on a rectangle. Using a uniform mesh, one can 
consider the rectangular region as a sequence of nodes. Replacing v 2 by the sum of 
second order central differences in the X and Y directions, the partial differential 
equation is replaced by a system of linear equations: 



x 
x 

X 
X 
X 



X 
X 
X 
X 
X 



X 
X 
X 
X 
X 



X 
X 
X 
X 
X 



(3.12) 



H,k - 2 *j,k + 



j+1.k 



>Lk+1 ■ 2 *j.k + *j.k-1 



p lk 



Numbering the nodes in lexicographic order 

1,1 2,1 N,l 1,2 ... N,N , 
the system of linear equations has the form 
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(3.13) 




<t> » Y 



where Y is formed from the boundary values of <t> and the values of p . This is an 
archtypal structured sparse system. 

There are five non-zero diagonals, with three of them clustered around the main 
diagonal and the other two spaced apart from these by a distance equal to the number 
of nodes in one direction on the grid. This sparsity pattern defies vectorization with 
the algorithms introduced so far. 

The key fact needed to vectorize this problem is that the matrix above is a 
consistently ordered 2-cyclic matrix. 

One might know the 2-cyclic nature of this matrix from the fact that it is block 
tridiagonal, but this fact is geometrically recognizable if we go back to the mesh and 
ask ourselves what the SOR method implies. 

Each iteration, one passes over an the nodes, updating the value for <t> from those 
values "around it". If we order the nodes lexicographically, each node is updated from 
two values at the last iteration and two values of the given iteration. The recursion is 
obvious. What if we reorder the nodes, dividing them into two classes, say the red and 
the black. 



(3.14) 
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It is geometrically clear that the red nodes are updated from the black nodes and 
conversely. Thus, with this ordering, it should be possible to vectorize the SOR. 



If we go back to the algebraic problem and reorder the nodes so that all the black 
nodes precede the red ones, the system assumes the form 



(3.15) 



u 



\ 



\ r-A m 



J WW 



where Di and D2 are diagonal matrices, B and C are structured sparse matrices, <t>\ 
and <t>2 are the <t> values on the black and red nodes, respectively. 

The Gauss Seidel algorithm applied to this reordered system yields: 



(3.16) 



n <b (n+1) 



B <t>, 



(n) 



8 <n+1) - y 2 . c ^(n+1) 



Note that the structured sparse matrices are now on the righthand side, not the 
lefthand side. Thus, we have traded equation solution with a structured sparse system 
for matrix multiplication by a structured sparse matrix. This latter operation is 
eminently vectorizable if B and Care stored by diagonals. 

For a general discussion of SOR methods see (9) and (10). For implementation 
information on a red-black problem see (8). 

Finally, we must address the question of convergence. When we rearrange the order of 
the variables in a scheme such as SOR which updates as soon as possible, there is at 
least the possibility that the convergence rate may be altered, if not destroyed. 
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However, in this ease it is known that both the lexicographic order and the red-black 
order are consistent orders, and thus the spectral radius of the respective SOR 
iterative matrices are equal which implies that the asymptotic convergence rates are 
equal. Of course this does not guarantee that one will see identical convergence rates 
for a finite number of iterations. Experiments in a large number of cases has 
demonstrated heuristically that actual rate of convergence for the red-black ordering 
is no worse than the lexicographic. Indeed, in most cases it has been slightly better 
from the standpoint of number of iterations. Of course the speedup in time per 
iteration is phenomenal. 



Finally, it should be pointed out that the same red-black ordering may be applied to 
three-dimensional problems. (Indeed it can be applied- to one-dimensional problems as 
well.) Also, replacing the Laplacian with a more general elliptic operator does not 
affect the sparsity considerations, but may affect the actual convergence rate. 

We close with an example of vectorized direct approach to a structured" sparse 
problem, namely point cyclic reduction applied to a tridiagonal system. 

Recall that a tridiagonal system is one in which all elements except for the three 
diagonals about the main diagonal are zero. A 5x5 tridiagonal matrix has the form: 



(3.17) 



o 






u 1 

B 3 

















Matrices such as this arise in many applications, especially in alternating direction 
implicit methods. 

There are several popular methods for solving such systems on a vector processor. 

While Gaussian Elimination becomes a scalar recursive algorithm due to the sparsity 
pattern, it can still be done efficiently for moderate size systems on the CYBER 200 
by using the large register file and tuning the scalar code accordingly. 
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One can think of a tridiagonal system as arising from lexicographic ordering for a 
one-dimensional mesh. Thus the red4>lack ordering scheme discussed previously can be 
employed. J 

However, cyclic reduction is the most efficient method for large systems. 

The basis for the cyclic reduction algorithm on tridiagonal systems is the following set 
of observations. ^ 

1. Consider three consecutive rows of the matrix with an even row in the middle 

2K-1 O O . . . O X X X O O 0~ . . . e 

2K 00... OOX XXOO... O 

2K+1 00...000XXX0...0 

The 2Kth row couples X 2K with X 2K+1 and X 2K _ 2 . If we add the proper 
multiples of rows (2K-1) and (2K+1) to row 2K, we can eliminate this coupling 
at the cost of coupling X 2K with X 2K+2 and X 2K _ 2 . Thus 



2K 00 ox®x(x)xo... 

2. If we carry out this procedure for all even rows 

(i) We have uncoupled aU the even variables from the odd variables. 

(ii) The resulting system for the even variables is again tridiagonal. 

dii) The odd equations provide a diagonal system for solving the odd 

variables in terms of the even variables. 

By observation (ii) we can apply cyclic reduction again and again, each time 
halving the size of the system remaining. 

Can we vectorize this algorithm, and what kind of data storage will it take to 
accomplish it? ^ 
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If one looks at the multipliers needed to accomplish the first 
(3.18) 



stage reduction: 



row 1 + row 2 + - — =. 



row 3 



row 2 



- -— * J row 3 + row 4 + I - — i 



row 5 



row 4 



cUldrmp^rourthJ^en S?£2 cfX^Sft *T ^ * di ™' °" e 
apply the results to the diagonauTstored odf I's^nd n'r d ° V ^ ° r division md then 
form for the even rows: odd Bs and C's, resulting in the following 



(3.19) 



row 2 



row 4 



A 2 + C n 




-T* 



A 4 + C 3 



.-J 



+ B C 



th h e e r^Sgo^ s^rSe^I/ 8 ™ ^ At — S ^ *• - of 

X vSSabtTth^^^^ fotd° r ftr^ Ch "*»« « "** ^ N = 8. 
below the line which have "Lady S foTd: ^^ SySt6m inV ° lvin * the variable * 

(3.20) 

3 

2 5 

4 6 7 



8 



4 2 

8 4 

- 6 
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Thus at each stage of the substitution one must merge variables just solved for in order 
to prepare for the next level. As for the vectorizability of the diagonal system 
solution, if one considers the last diagonal system for N = 5, 

(3.21) t 



x 1 


= 


A 1 


< Y 1 


- C 1 


x 2 ) 


><3 


sat 


1 
A 3 


< Y 3 


" C 3 


X 4 - B 3 X 2 ) 


X 5 


3 


1 
A 5 


< Y 5 




"- % x 4 ) 



we see that the diagonal storage scheme aUows this to be completely vectorized. 

Experiments on the STAR 100 (6) have shown that cyclic reduction is superior to 
Gaussian Elimination (essentially scalar for tridiagonal systems) for N >150. There is 
no doubt that this crossover number rises on the CYBER 203, due to its enhanced 
scalar performance. Tests are being conducted at the time of this writing. However 
if one recalls the projected speedups in the compress and merge operations on the 
CYBER 203E (Specifically, see (2.27).) it is clear that the cross over point will droD 
dramatically on this architecture. 



4. CONCLUSION 

By no means have we exhausted the material that could have been presented in section 
3. In particular, no mention was made concerning the problems of eigenvalue and 
eigenvector solution. This will be left for a later publication. 

Finally one should not forget that the CYBER 200 is a powerful scalar processor as 
weU. An excellent example of a dual vector-scalar algorithm for Bunemann's variant 
ot block cyclic reduction can be found in (4). 

We have introduced the concept of vector processing as embodied on the CYBER 200 
We have endeavored to show that the CYBER 200 hardware has the primitive 
operations, supported by FORTRAN, to translate many syntactic kernels into single 
instructions. More importantly we have endeavored to show that semantic 
vectonzation is possible and that vector numerical analysis is being developed to 
provide the algorithm designer with the insights that will enable a new plateau of 
calculational ability to be reached. .. • - 
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INTRODUCTION 

The invention of the stored program computer along with the 
construct of high level languages has revolutionized Numerical 
.Analysis- This revolution has spread through much of Applied 
Mathematics as well as some aspects of Pure Mathematics- The 
essence of this revolution is the capability of carrying out 
computations easily and efficiently which are beyond the realm 
of possibility of a human with pencil and paper. 

For example, Kawaguti in nS3 published results for the classic 
Fluid Dynamics problem of two-dimensional flow past a circular 
cylinder based on approximately 10 S min- of calculation using 
a mechanical desk calculator. Recently B- Fornberg of the 
California Institute of Technology has published further re- 
sults on this problem for higher Reynolds number- His esti- 
mate for the problem solution on a standard commercial scalar 
processor of the present is ID 2 min- 

It is by now widely appreciated that the difference between 10 S 
min. of human resource and even 10 5 min. of computer time rep- 
resents a revolution. A year and a half of a person's life is 
an incredible price to pay for one set of numbers. If as 
Hamming has saidi " The object of computing is insight, not 
numbers"-, then one must consider Kawaguti's attempt a valiant 
failure in the art of computing. 

„ COMPANY PRIVATE 
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Although the 10 3 min- is a feasible execution time, it is not good 
enough- If one were experimenting with the various parameters of 
this problem, such as mesh size or outflow boundary conditions, 
even 10 E begins to loom large- These considerations are mostly 
quantitative- Of far more qualitative interest is the fact that 
as the Reynolds number increases-, the computations needed to 
solve even one case become enormous. In particular, up to the 
present, there were no reliable results for Reynolds number above 
200- For all the power of scalar computers, solutions in this 
regime are beyond them- What we need is another revolution. That's 
what we've got!! Fornberg has just published computationally re- 
liable results for the flow past cylinder problem with Reynolds 
number 30Q. Such a solution takes under 1 min. on the CDC STAR- 
100. This two order of magnitude increase in computational effic- 
iency represents just as nuch an advance over the scalar computer 
as the scalar compjter represents an advance over Kawaguti's heroic 
attempt. And yet it is only the beginning. Even at Reynolds number 
300 one cannot decide between the Brodetsky vs. the Batchelor con- 
jectures concerning the nature of such flows as the Reynolds num- 
ber increases. 

Where do we go from here? What is the qualitative essence of this 
revolution? What problems can we expect to solve with this new 
tool? What problems can we not expect to solve with this new 
tool? It is our objective here to at least develop a vocabulary 
so that we can ask these questions in an intelligent manner. The 
answers, hopefully, will keep us happily and gainfully employed 

for the near future- _..„.--......«. 
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How is the vector processor qualitatively different from the 
scalar processor? The scalar processor treats each individual 



calculation as a new adventure- Hence, within the limits of 
memory banking, a scalar processor can treat data that is not 
contiguous as well as it can data that is contiguous. Perhaps 
a better way to put this fact is that the scalar processor treats 
contiguous data as poorly as it does non-contiguous data. The 
vector processor, on the other hand, treats assemblages of data 
as units- By "taking the larger view" it is possible to gain 
internal parallelism and optimize the ha-duare within the pro- 



cessor ■ 



It. is not our object here to detail the internal architecture 
of the CYBER 200 nor to debate its merits vs- other vector 
architectures, nor even to demand that a system like the CYBER 
200 must be 100* utilized to be cost effective- Some users will 
be content with attaining that portion of the system's power 
which is achievable with present programming practices- They 
will see performance improvements due primarily to the large 

,, • l!M -ps, and secondarily to 
memory and scalar processing capaoil it ies ° y 

..... T i ■ ,j»antaae of only a portion 
vector processing aoility. Taking advance* / h 

. ,-l... .. _ ; «r a common occurence for 
of a large computer's architecture is a 

at least some portion of each installation s *or 
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Today I address those in. the supercomputer community who a-. 

interestedin assaulting the next plateau of computational 
•bility. The success on failure of a given tool in this en- 
deavor will depend as much on how intelligently it is usedl 
as how intelligently it is designed. What we „ust ask is whether 
a given design evinces the promise of allowing human intelligence 
to use it to its fullest. Thus we will limit ourselves to a phen- 
omenological description of CYBER 200 performance and how it applies 
to some common linear algebraic algorithms. 

To quantify the performance of the CYBER ?00 architecture let us 
introduce the following: 

DEF. nEGAFLOP {abbreviated .flop henceforth>: the ability to do 
one million floating p oint operations in one second. 

As a scale of reference, the CDC 7b00 is a 3-10 .flop computer 
depending upon the particular algorithm and its implementation. 
The key fact, however, is that this range of performance can be 
considered 0O> mflop. Several present vector processors, in- 
cluding the STAR-100 are capable of D<in> mrinn T 

v "ie or u-UiUJ mflop. In contrast t 

this, consider the following hardware rates for the CYBER 5DS , 
an enhanced version of the Control Data CYBER 203 . This compete 
can do arithmetic in both 3, and ui-bit modes and has either two 
or four vector pipelines- 

COMPANY PRIVATE 



o 



PSI EXCERPTS September, MflO 

Paye- lMfl-35 

CDC CYBER 170/70/bOOO/gQO 

VECTOR PROCESSING - PROBLEK OR OPPORTUNITY -C cont'd} 



TABLE I 

CYBER gP5 ' ADD OR MULTIPLY flEGAFLOP RATE 
Column A is' for the two pipe architecture and tM-bit operands* 
.Column B is for the two pipe architecture and 32-bit operands or 

the four pipe architecture and bM-bit operands- 
Column C is for the four pipe architecture and 32-bit operands. 

VECTOR 

LENGTH II C 

2S 20. D 22-2 23-1 

SO 33-3 10-0 »43.? 

b4 3fl.b m. & 54-2 

100 50.0 bb-7 7A.M 

500 83.3 1H2-1 220-3 

1000 10-0 lbb-7 264. 1 

10000 «H.O 11t>.l 364-3 

soooo «n.a m-2 3it>.6 

100 200 MOO 

It is clear that here is a tool almost orders of magnitude better 
than the best scalar computer of today, and alaost an order of 
magnitude better than the present class of vector processors- And 
this is only the beginning. For certain alg° r i thrBSl t^ei r ex- 
pression as data structures leads to linked triads defined bslo- 1 - 
TABLE II details the hardware performance of such triads on the 
CYBER EQS, -[Columns A, B and t mean the sarne. as in TABLE L> 

COMPANY PRIVATE 
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DEF. LINKED TRIAD - a triadic combination of two vectors and one 
scalar that can be evaluated as one operation. 

The two most important examples of this structure are: 

Vector + Scalar*Vector 
■CVector+Scalar} * Vector 



TABLE II 
CfB€R_?0S -' LINKED TRIAD HEGAFI OP Bur 



VECTOR 
LENGTH 

2S 

50 

b4 

100 

SOD 

1000 

10000 

SO000 



A 


8 


25. b 


27.4 


45-5 


51-3 


54.7 


t.3.4 


74.1 


10-1 


14T.3 


235-1 


170.'! 


2SA.5 


nt.7 


3flb.fi 


1ST. 3 


3=n.3 


200 


400 



c 

2fl.4 
54. fl 
bfl-fl 
102. b 
331-0 
4?b-2 
74 s }. 1 
76=1.3 
flOO 



To complete this description of the CYBER 5 0S capability, let 

us also mention that the full vector processing capability is 

sustainable while up to lb channels are bursting at 200 megabits 

«ch. Real memory will be available with up to 4 million words- 

Last but not least, Control Data\xpects to complete the first 
CYBER 20S< in „ aQ . , e are thus not taik . ng a ^ t sQ ^ .^^ 

<r« B based upon as yet unrealized breakthroughs- Ue are talking 
3b0Ut a real " m Puter. COMPANY PRIVATE 
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•C c on t ' d> 



Indeed, if 0ne uould Hke fc( 



CDC 



speculate about the near f ut 



is even now designing computers fc 



urei 



°r Che «iddl 8 60»s with 
peak performance in the CK1000>mflop range. 



Uhat 



we have described so fa 



raw power. Can it be harna: 



r is a calculational engine 



of immense 



•ssed to perform useful work without 
s^ious degradation, Spending upon your viewpflinti ^ ^^ 
poses a problem or provides an opportunity. 

To see this, let us perfQrm ^ following ^^ .^.^^ 
Suppose our processor has the capability of performing calculations 
Simultaneously in the vector and scalar processor. <The CYBER 3Q5 

has this capability}. Uithaur r-„„, j «• 

v wicnout rega-d for the possible logical de- 

pendencies, given that the uor^. 

6 Vector Processor is one order of magni- 
tude faster than the scalar nn n , -^ • 

Cdiar Processor, it is necessary f r lo* of 

the calculations to be donp i „ «-u 

aone in the vector processor in order not to 

be "scalar" bound. Uith a ♦.,.„ 

men a two orders of magnitude ratio, the per- 
centage climbs to ll* and for- t-u- _, -#■ 

° ror three orders of aajmtude 1*1. "i*. t 

this attainable? Herein lies t- ho u, ~~ +u 

nes the problem, cr the opportunity. 

How can we solve the oroblom ^ 

P &lem or take advance of the opportunity? 

Let us first outline one mpf-hr.w.,1 -■*■:•», 

ne methodology that definitely, positively will 

not work- 



The general philosophy of thi 



s methodology i s t0 take all the accu- 



mulated prejudices that h^un l , • 

H j e tnac nave been acquired working with previous 

generations of computers anri . . j,„* »*.« 

m ers and use them to judge th e possibility f 

attaining the Performance i isted - n JABL£S I and II. 

COMPANY PRIVATE 
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^ PraCUcal -PP»«tio„ of this philosophy is 

"ving to neW plateaus of solvina „ • *' 

solving physics problems «,. 
Problems, or mathematics problems , peering 

- — ^ :;::;:r;r ic p ^— 



o 

es. 



Once you have gotten this fa- ■„ 

s ra ' ■• y° u are trapped. Th* „ 7 
-t this point is to deny the info ,- ' "^ ^ 

y Cne formation theoretic ,-„ k 
the second law of thpr m nH __eor e tic counte rmart of , 

% — - ——____ ' thermodynamics. A* ;,„ - . 

c^sT^T^To^ ri • " 0ri91nal Pr ° blem is »»c 

y transformed, lt ls being filt . redi 

computer code- Each filter w lf,ally a 

'— S a-, left £ o t „, scaIar ' -.t h „ etic 

se8 «,.„ jl: e 7 - ot ^ " - • >— Ik .- Ptic t0 

syntactic vector izat ion of n? «f 
problem indeed. ' * "^ iS * f °™idable 

Instead of thinking f all this as a probl , 

° f it as an oppctunity th ' "^ <"""»' 

hpo. curuty, the opportunity to «„> 
the range of the seal, to solve problems beyond 

ac ui cne scalar and vector 



a researcher in fl - H M Pressors of today. Recently 

c*er m fi uid dynamics was ai y 

-. !-....« ..,.,„. W0D equstions ln ; 7 — " ««.- 

°— Pipe cveER , 05 Uhen •••• -"--..• *.s ,. cond , 

uvl sed of this f^rh u 
far away i 00k in h . niS fdctl he got that 

y ^ook m his eyes. What new S u a 

processor open up , dVenues 0f th °^t does this 

Pt ' up COMPANY Private 
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>s is not a 



Uel1 ^ don't run off to the ivory tow er . 

unconstrained n^n • '"*" * a noc * " s e of 

ined opti BI2 ation. Let's try to n ,.. 

approach to this «n™ ► ■ outline a positive 

? this opportunity that sati S f ies tha 
5traints. ■ es the n ee«ssary con- 



FirStl U is «l«r that not every probIo 

super vector Droc Sf1 °° ld be "^.d on a 

vetcor processor. Uhsfhor „ 

•"lecner or not n,. « 

"ices even the scalar c „ *' rMl m9mor * 

ne scalar component worth the Dr i ce „, ... 

computer i s not th. th * yhole 

is not the point. u e rMuljl<li 

^nd or desktoo i ,. ' * *' Pr ° bl * as wit * 

Desktop instruments tha- nn,-„ 

»** ^ , stUl hd ° nCe WePa """"". Problem. 

- still have *ainfra me s, Obviously beCduse ^ 
1-rged our problem nori20n , '"^ *« h *" .n- 

as w e have enlarged our «■«.,„ * • 
"Polity. I.,., „« „„ p ^ «"<"«'-9 

class In* „ ri ««"tly on .aclun.s „, th < 

ciass. Indeed-, a ma ior m- =0s 

J co^onent of the viability of a neu 
P^r in the „a rketplace is . «"•««.- 

*« ability to give a quantua „«. » 
ance gain with purely syntax • Perform- 

y synta c t lc transition. But when wo „ 
a machine with a peak stPM you deal with 

Peak stream rate of 800 .flop, &Q mflQB > 

look all that impressive a„ P J ° St do «n't 

h =".ve any mor e . 

What have we concluded so f ar : 

1- Syntactic conversion of r „ 

co °es win gain js respectable aiW 
ance but will not do i n th P forsa - 

cn e long run- 
s' Components of existing 

•a c °aes that can cake advantage of 
super vector processor's .«T 

stream rate should be identifier! 
to aid in the transition. ~ 

3. Mew programs should be do, , 

eveloped in ter« of algori Cn ~ s th 
map well onto the sup e ~ "' 

Sector processor. 
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various applicable disciplines- While the acre subtle results of 
this activity are yet to be determined, a good place to start is 
with 

VECTOR NUMERICAL LINEAR ALGEBRA 

.The algorithms of Linear Algebra provide a convenient starting 
point for constructing our desired algorithm class. flatrix multi- 
plication and linear equation solution a-e not only important in 
real world applications, but they have also achieved a certain 
status as benchmark problems. {In some ways, our analysis is 
paralleled by that of <3> in the construction of the 8LAS. How- 
ever, we differ in that in terms of our algorithm criteria, the 
BLAS a.-e too low level, too general {and yet too specific}, and 
too syntax oriented. They may be fine on a scalar processor, but 
will not do on the super vector processor}. 

life' shall limit ourselves her e to brief examples of several ele- 
mentary operations and how present knowledge of their proper im- 
plementation on the CYBER 20S does or does not fit our criteria. 



Let us first establish a vocabulary of matrix spa.sity. A matrix 
is considered full if a priori all elements aust be considered 
non-zero. A matrix is considered la-ge banded iff all elements 
outside the diagonals P dia gondls from the »ain diagonal are zero 
and all elements inside these diagona is are considered non-zero. A 
matrix is structured sparse iff the n0 n-zeros lie along a fixed 
relatively small number of diagonals- <Thereis a fourth class 
which consists of those matri Ces that have 0{'J> non-zeros in a 
completely random pattern. u e shaH n0t consider these}. 
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*•- «- .«.., «... < s . e U) ; it ; r " i " st -° " «-'•««. »iu... 

pei i or ma nee • 

'-■• >«-, -co,.. and . diagonal „ refe ;; e 7"- «■ «>• «,. 

given matrix. ths storage order of a 



J 



TA8LE III 



OPERATION 



DATA 

11RUCTURE 



MEAN 



LINKED 
TRIAD 



C=AS 
FULL 



LARGE BANDED 

STRUCTURED 
SPARSE 



A col. A 

B or B row 

C col . C row 

Ditto 



A diagonal 
B diagonal 
C dia gona 1 



5P 



lengths 
of 

diagonals 



yes 



YES 



NO 



AX=Y 
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REDUCTION 



FULL 


A 


row or 


col- 


2N/3 


LARGE BANDED 


A 


row or 


col • 


P 


SUBSTITUTION 










FULL 


A 


row 




N/2 




A 


col . 




N/2 


LARGE BANDED 


A 


col • 




P 




A 


row 




P 


FULL GAUSSIAN 










INVERSION 










ax-v 










REDUCTION 


A 


row or 


col • 


2N/3 



YES 

YES 

INNER PRODUCT 
YES 

YES 

INNER PRODUCT 



YES 

AY auamentsd 

r QUS SN/3 YES 

SUBSTITUTION A row N YES 

A col. N/2 YES 

SOR on full or la-ge bandes systems uses the satrix cult i pi icat ion 
and substitution algorithms listed above- 

STRUCTURED SPARSE -CS0R} 
as applied to an 
N 3 mesh to 



solve red-black 

elliptic 8V node order 

pr ob 1 em 



N 3 /2 NO 
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CYCLIC REDUCTTniu 




SYSTEn 
AX=Y 



A diagonal 



N/log N 



NO 



>«t us examine th. algorithms in TA8LE III in thp . . h 
five criteria. F irst th ' llght ° f tne 

for specialty, admittedly this i« „ 

•r~,h '3 needed. This uiu 

11 *m not happ 8n unf;i t-i, 
, nnn . , HP LmCU th e "aforementioned 

1«P » Closed t0 connect co „ putationa , 

-'-■»'«» •"«■ T his is the ... o f th . Contr 0r ' 9,nal 

research in fh- Control Data sponsored 

;p;— .».n.tH„.. lgorithms are not 

m 7 UP01nt "- "— ^ - On the other hand, they -p . 

high level enough to allow a super vector „ 
5QS nn( . , processor like the CYBER 

205 not to get bogged down, i.e., to aUou .. 
'to .! , the actua l Performance 

to closely match the theoretical performance t ' 

, Bnnhh _. rormance for a given vector 

l^gth. f 1IHllyi uithin fcne limita 

... n or rea l memory!, each of 

these algorithms i s monotonical ly m0 r e e ff iri - ► 

size increases. """lent as the problem 



In conclusion, the super vector processor „ 

opportunitv to • " ^ "" un P-^edented 

opportunity to move into new calculations i „ • 

nal r egimes 1 provided we 
approach it intelligently. Part of this i „, ,, • 

"'"- <- - • — - >* ,„:*;';;" is the r °- 9 - 

-»« - insist a,o„ 9 „ Ith p hysUa, ,„ „ , ; UM """" '" '- 
uity o„H „at„,„at, , ■ '"tuition, eng.n.erin, i„ gen _ 

'■ u md *-nemai.ical rigor. 

COMPANY PRIVATE 



